Homework 1¶
What this repo contains¶
This repository contains statistics and charts on dataset for obesity. Here we performed univariate and multivariate analysis.
From the univariate analysis we can obtain the following details:
- Age is leptokurtic and right skewed. This indicates that our dataset is quite targeted on a younger population.
- Height and weight are mesokurtic and symmetric. This indicates that our dataset has a wide variation in its data.
- Many of the surveyed individuals in the available dataset already present a family member that face this issue in their family, thus indicating that this also might be based on genetics.
- Many of the surveyed individuals are frequently consuming high-caloric foods.
From the multivariate analysis we can obtain the following:
- Based on scatter-plots (screenshot #1 below) and boxplots (screenshot #2, #3 below), the attributes with the highest prediction power are:
- Height
- Weight
- Age
- Based on the ANOVA score (screenshot #4 below), we could also observe that FCVC (Frequency of consumption of vegetables) is inversely correlated with obesity, thus vegetarians might be on the normal weight category.
- Many of the surveyed individuals that consume foods with high calories don't smoke.
- Many of the surveyed individuals that consume foods with high calories don't monitor their calories intake.
- Many of the surveyed individuals that have low weight (and probably low height) are anorexic (source: stacked histograms).
Screenshots¶
Screenshot #1: Scatter-plot based on height, weight, and obesity prediction¶
- The scatter plot displays that using Height and Weight, the data could be easily separable

Screenshot #2: Boxplot based on weight and obesity prediction¶
- The boxplot shows that label variable separates the numerical attribute Weight quite well.

Screenshot #3: Boxplot based on age and obesity prediction¶
- This boxplot helps us separate Obesity_Type_II and Obesity_Type_III using the Age attribute. Separating these 2 classes using just Weight attribute is more dificult.

Screenshot #4: ANOVA score¶
- The ANOVA test confirm what the plots above show us in regards to the Weight, Height and Age variables. Also, it shows that FCVC also has a good predictive power.

Screenshot #4: CAEC, Gender, family_history_with_overweight, CALC¶

- The CHI2 score shows that CAEC, Gender, family_history_with_overweight and CALC
Dependencies¶
!pip install ipympl
!pip install plotly
Requirement already satisfied: ipympl in ./venv/lib/python3.11/site-packages (0.9.3) Requirement already satisfied: ipython<9 in ./venv/lib/python3.11/site-packages (from ipympl) (8.22.2) Requirement already satisfied: numpy in ./venv/lib/python3.11/site-packages (from ipympl) (1.26.4) Requirement already satisfied: ipython-genutils in ./venv/lib/python3.11/site-packages (from ipympl) (0.2.0) Requirement already satisfied: pillow in ./venv/lib/python3.11/site-packages (from ipympl) (10.2.0) Requirement already satisfied: traitlets<6 in ./venv/lib/python3.11/site-packages (from ipympl) (5.14.1) Requirement already satisfied: ipywidgets<9,>=7.6.0 in ./venv/lib/python3.11/site-packages (from ipympl) (8.1.2)
Requirement already satisfied: matplotlib<4,>=3.4.0 in ./venv/lib/python3.11/site-packages (from ipympl) (3.8.3) Requirement already satisfied: decorator in ./venv/lib/python3.11/site-packages (from ipython<9->ipympl) (5.1.1) Requirement already satisfied: jedi>=0.16 in ./venv/lib/python3.11/site-packages (from ipython<9->ipympl) (0.19.1) Requirement already satisfied: matplotlib-inline in ./venv/lib/python3.11/site-packages (from ipython<9->ipympl) (0.1.6) Requirement already satisfied: prompt-toolkit<3.1.0,>=3.0.41 in ./venv/lib/python3.11/site-packages (from ipython<9->ipympl) (3.0.43) Requirement already satisfied: pygments>=2.4.0 in ./venv/lib/python3.11/site-packages (from ipython<9->ipympl) (2.17.2) Requirement already satisfied: stack-data in ./venv/lib/python3.11/site-packages (from ipython<9->ipympl) (0.6.3) Requirement already satisfied: pexpect>4.3 in ./venv/lib/python3.11/site-packages (from ipython<9->ipympl) (4.9.0) Requirement already satisfied: comm>=0.1.3 in ./venv/lib/python3.11/site-packages (from ipywidgets<9,>=7.6.0->ipympl) (0.2.1) Requirement already satisfied: widgetsnbextension~=4.0.10 in ./venv/lib/python3.11/site-packages (from ipywidgets<9,>=7.6.0->ipympl) (4.0.10) Requirement already satisfied: jupyterlab-widgets~=3.0.10 in ./venv/lib/python3.11/site-packages (from ipywidgets<9,>=7.6.0->ipympl) (3.0.10) Requirement already satisfied: contourpy>=1.0.1 in ./venv/lib/python3.11/site-packages (from matplotlib<4,>=3.4.0->ipympl) (1.2.0) Requirement already satisfied: cycler>=0.10 in ./venv/lib/python3.11/site-packages (from matplotlib<4,>=3.4.0->ipympl) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in ./venv/lib/python3.11/site-packages (from matplotlib<4,>=3.4.0->ipympl) (4.49.0) Requirement already satisfied: kiwisolver>=1.3.1 in ./venv/lib/python3.11/site-packages (from matplotlib<4,>=3.4.0->ipympl) (1.4.5) Requirement already satisfied: packaging>=20.0 in ./venv/lib/python3.11/site-packages (from matplotlib<4,>=3.4.0->ipympl) (24.0) Requirement already satisfied: pyparsing>=2.3.1 in ./venv/lib/python3.11/site-packages (from matplotlib<4,>=3.4.0->ipympl) (3.1.2) Requirement already satisfied: python-dateutil>=2.7 in ./venv/lib/python3.11/site-packages (from matplotlib<4,>=3.4.0->ipympl) (2.9.0.post0) Requirement already satisfied: parso<0.9.0,>=0.8.3 in ./venv/lib/python3.11/site-packages (from jedi>=0.16->ipython<9->ipympl) (0.8.3) Requirement already satisfied: ptyprocess>=0.5 in ./venv/lib/python3.11/site-packages (from pexpect>4.3->ipython<9->ipympl) (0.7.0) Requirement already satisfied: wcwidth in ./venv/lib/python3.11/site-packages (from prompt-toolkit<3.1.0,>=3.0.41->ipython<9->ipympl) (0.2.13) Requirement already satisfied: six>=1.5 in ./venv/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib<4,>=3.4.0->ipympl) (1.16.0) Requirement already satisfied: executing>=1.2.0 in ./venv/lib/python3.11/site-packages (from stack-data->ipython<9->ipympl) (2.0.1) Requirement already satisfied: asttokens>=2.1.0 in ./venv/lib/python3.11/site-packages (from stack-data->ipython<9->ipympl) (2.4.1) Requirement already satisfied: pure-eval in ./venv/lib/python3.11/site-packages (from stack-data->ipython<9->ipympl) (0.2.2) Requirement already satisfied: plotly in ./venv/lib/python3.11/site-packages (5.20.0) Requirement already satisfied: tenacity>=6.2.0 in ./venv/lib/python3.11/site-packages (from plotly) (8.2.3) Requirement already satisfied: packaging in ./venv/lib/python3.11/site-packages (from plotly) (24.0)
import csv
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import matplotlib
import seaborn as sns
import typing as t
import numpy as np
import numpy.typing as npt
import itertools
import pandas as pd
from scipy import stats as scpy_stats
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
import plotly
import plotly.graph_objs as go
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")
plotly.offline.init_notebook_mode()
%matplotlib widget
%matplotlib inline
class Person:
Gender: str
Age: int
Height: float
Weight: float
family_history_with_overweight: str
FAVC: str
FCVC: int
NCP: int
CAEC: str
SMOKE: str
CH2O: int
SCC: str
FAF: str
TUE: int
CALC: str
MTRANS: str
NObeyesdad: str
def __init__(
self,
Gender: str,
Age: int,
Height: float,
Weight: float,
family_history_with_overweight: str,
FAVC: str,
FCVC: int,
NCP: int,
CAEC: str,
SMOKE: str,
CH2O: int,
SCC: str,
FAF: int,
TUE: int,
CALC: str,
MTRANS: str,
NObeyesdad: str,
):
self.Gender = Gender
self.Age = Age
self.Height = Height
self.Weight = Weight
self.family_history_with_overweight = family_history_with_overweight
self.FAVC = FAVC
self.FCVC = FCVC
self.NCP = NCP
self.CAEC = CAEC
self.SMOKE = SMOKE
self.CH2O = CH2O
self.SCC = SCC
self.FAF = FAF
self.TUE = TUE
self.CALC = CALC
self.MTRANS = MTRANS
self.NObeyesdad = NObeyesdad
def __str__(self):
return (
"{"
+ f'"Gender": "{self.Gender}",'
+ f'"Age": {self.Age},'
+ f'"Height": {self.Height},'
+ f'"Weight": {self.Weight},'
+ f'"family_history_with_overweight": "{self.family_history_with_overweight}",'
+ f'"FAVC": "{self.FAVC}",'
+ f'"FCVC": {self.FCVC},'
+ f'"NCP": {self.NCP},'
+ f'"CAEC": "{self.CAEC}",'
+ f'"SMOKE": "{self.SMOKE}",'
+ f'"CH2O": {self.CH2O},'
+ f'"SCC": {self.SCC},'
+ f'"FAF": "{self.FAF}",'
+ f'"TUE": {self.TUE},'
+ f'"CALC": "{self.CALC}",'
+ f'"MTRANS": "{self.MTRANS}",'
+ f'"NObeyesdad": "{self.NObeyesdad}"'
+ "}"
)
def __len__(self):
return 17
__repr__ = __str__
NUMERICAL_VARIABLES = ["Age", "Height", "Weight", "FCVC", "NCP", "CH2O", "FAF", "TUE"]
CATEGORICAL_VARIABLES = ["FAVC", "CAEC", "CALC", "SCC", "MTRANS", "Gender", "family_history_with_overweight", "SMOKE", "NObeyesdad"]
CATEGORICAL_VARIABLES_NO_LABEL = ["FAVC", "CAEC", "CALC", "SCC", "MTRANS", "Gender", "family_history_with_overweight", "SMOKE"]
LABEL_VARIABLE = "NObeyesdad"
class DatasetManager:
def __init__(self, path_to_csv: str):
self.path_to_csv = path_to_csv
def load_as_obj_list(self) -> list[Person]:
with open(self.path_to_csv) as csv_file:
csv_reader = csv.DictReader(csv_file)
return [Person(**row) for row in csv_reader]
@staticmethod
def obj_list_to_dataseries(
data: list[Person],
attrs_list: list[str] = NUMERICAL_VARIABLES + CATEGORICAL_VARIABLES,
) -> t.List[t.List[np.float32]]:
return [
[np.float32(getattr(entry, field)) for entry in data]
for field in attrs_list
]
@staticmethod
def obj_list_to_flat_np_array(
data: list[Person],
attrs_list: list[str] = NUMERICAL_VARIABLES + CATEGORICAL_VARIABLES,
) -> npt.NDArray[np.float32]:
return np.array(
[
np.float32(getattr(entry, field))
for entry in data
for field in attrs_list
]
)
@staticmethod
def obj_list_to_list(
data: t.List[Person], attrs_list: t.List[str] = CATEGORICAL_VARIABLES
) -> t.List[t.List[str]]:
return [[getattr(entry, field) for entry in data] for field in attrs_list]
@staticmethod
def obj_list_to_flat_list(
data: t.List[Person], attrs_list: t.List[str] = CATEGORICAL_VARIABLES
) -> t.List[str]:
return [getattr(entry, field) for field in attrs_list for entry in data]
@staticmethod
def make_combinations(
attrs_list: t.List[str], k: int = 2
) -> t.List[t.Tuple[str, ...]]:
return [subset for subset in itertools.combinations(attrs_list, k)]
def obj_list_to_np_array(
data: list[Person],
attrs_list: list[str] = NUMERICAL_VARIABLES + CATEGORICAL_VARIABLES,
) -> np.array:
return np.array(
[[getattr(entry, field) for field in attrs_list] for entry in data]
)
@staticmethod
def obj_list_to_np_array_numeric(
data: list[Person],
attrs_list: list[str] = NUMERICAL_VARIABLES + CATEGORICAL_VARIABLES,
) -> np.array:
return DatasetManager.obj_list_to_np_array(data, attrs_list).astype(np.float64)
@staticmethod
def obj_list_to_np_array_category(
data: list[Person],
attrs_list: list[str] = NUMERICAL_VARIABLES + CATEGORICAL_VARIABLES,
) -> np.array:
data_pd = pd.DataFrame(DatasetManager.obj_list_to_np_array(data, attrs_list))
return np.vstack([pd.factorize(data_pd[col])[0] for col in data_pd.columns]).T
@staticmethod
def obj_list_to_np_array_category_binary(
data: list[Person], attrs_list: list[str]
) -> np.array:
data_pd = pd.DataFrame(DatasetManager.obj_list_to_np_array(data, attrs_list))
return pd.get_dummies(data_pd).to_numpy().astype(np.float64)
@staticmethod
def object_list_to_pd_dataframe_category(
data: list[Person], attr: str
) -> pd.Series:
return pd.DataFrame(
DatasetManager.obj_list_to_np_array(data, [attr])
).value_counts()
@staticmethod
def object_list_to_pd_dataframe_contingency_table(
data: list[Person], lh_attr: str, rh_attr: str
) -> pd.DataFrame:
data_pd = pd.DataFrame(
DatasetManager.obj_list_to_np_array(data, [lh_attr, rh_attr]),
columns=[lh_attr, rh_attr],
)
return pd.crosstab(data_pd[lh_attr], data_pd[rh_attr], margins=False)
@staticmethod
def get_numerical_for_each_category(
data: list[Person], numerical_attr: str, categorical_attr: str
) -> list[tuple[np.array, str]]:
data_np = DatasetManager.obj_list_to_np_array(
data, [numerical_attr, categorical_attr]
)
unique_CATEGORICAL_VARIABLES = np.unique(data_np[:, 1]).tolist()
return [
(data_np[data_np[:, 1] == unique_val, 0].astype(np.float64), unique_val)
for unique_val in unique_CATEGORICAL_VARIABLES
]
dataset_manager = DatasetManager("data/ObesityDataSet.csv")
dataset_obj_list = dataset_manager.load_as_obj_list()
Univariate analysis¶
The following will be applied:
- central tendency
- spread
- distribution form (skewness, kurtosis)
- frequency of categorical data
- graphs
- histograms
- density
- boxplots
1. Central tendency¶
Calculates mean, median and mode for each data series.
def calculate_central_tendency_numerical(
np_dataset: np.array,
) -> t.Tuple[float, float, float]:
mean = np.mean(np_dataset)
median = np.median(np_dataset)
mode = scpy_stats.mode(np_dataset).mode
return mean, median, mode
for numerical_var in NUMERICAL_VARIABLES:
dataset_for_numerical_val = DatasetManager.obj_list_to_flat_np_array(
dataset_obj_list, [numerical_var]
)
mean, median, mode = calculate_central_tendency_numerical(dataset_for_numerical_val)
print(f"On numerical var {numerical_var}")
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}\n{['-' * 10]}")
# MODE nu cred ca e si la variabile numerice
On numerical var Age Mean: 24.312599182128906 Median: 22.777889251708984 Mode: 18.0 ['----------'] On numerical var Height Mean: 1.7016774415969849 Median: 1.7004990577697754 Mode: 1.7000000476837158 ['----------'] On numerical var Weight Mean: 86.58605194091797 Median: 83.0 Mode: 80.0 ['----------'] On numerical var FCVC Mean: 2.4190428256988525 Median: 2.3855020999908447 Mode: 3.0 ['----------'] On numerical var NCP Mean: 2.6856281757354736 Median: 3.0 Mode: 3.0 ['----------'] On numerical var CH2O Mean: 2.0080113410949707 Median: 2.0 Mode: 2.0 ['----------'] On numerical var FAF Mean: 1.0102977752685547 Median: 1.0 Mode: 0.0 ['----------'] On numerical var TUE Mean: 0.6578659415245056 Median: 0.6253499984741211 Mode: 0.0 ['----------']
2. Spread¶
Calculates the spread of data for each data series. Useful to know wether the data has a "central tendency".
def calculate_spread_numerical(np_dataset: np.array) -> t.Tuple[float, float, float]:
dataset_range = np.ptp(np_dataset)
dataset_variance = np.var(np_dataset)
dataset_standard_deviation = np.std(np_dataset)
return dataset_range, dataset_variance, dataset_standard_deviation
for numerical_var in NUMERICAL_VARIABLES:
dataset_for_numerical_val = DatasetManager.obj_list_to_flat_np_array(
dataset_obj_list, [numerical_var]
)
dataset_range, dataset_variance, dataset_standard_deviation = (
calculate_spread_numerical(dataset_for_numerical_val)
)
print(f"On numerical var {numerical_var}")
print(f"Range: {dataset_range}")
print(f"Variance: {dataset_variance}")
print(f"Standard deviation: {dataset_standard_deviation}")
print(f"{['-' * 10]}")
On numerical var Age Range: 47.0 Variance: 40.252235412597656 Standard deviation: 6.3444647789001465 ['----------'] On numerical var Height Range: 0.5299999713897705 Variance: 0.008701666258275509 Standard deviation: 0.09328272193670273 ['----------'] On numerical var Weight Range: 134.0 Variance: 685.6525268554688 Standard deviation: 26.184967041015625 ['----------'] On numerical var FCVC Range: 2.0 Variance: 0.2849425971508026 Standard deviation: 0.5338001251220703 ['----------'] On numerical var NCP Range: 3.0 Variance: 0.6050573587417603 Standard deviation: 0.777854323387146 ['----------'] On numerical var CH2O Range: 2.0 Variance: 0.37553396821022034 Standard deviation: 0.6128082871437073 ['----------'] On numerical var FAF Range: 3.0 Variance: 0.7231647968292236 Standard deviation: 0.8503909707069397 ['----------'] On numerical var TUE Range: 2.0 Variance: 0.37061673402786255 Standard deviation: 0.6087830066680908 ['----------']
3. Skewness, kurtosis¶
Calculates skewness and kurtosis of the dataset.
Meaning for skewness:
- Positively skewed (right-skewed):
- The distribution is positively skewed if the distribution's tail on the right side is longer or "fatter" than the left side. This means that there are more data points on the left side, and the distribution as a longer right tail.
- Values:
> 1
- Negatively skewed (left-skewed):
- The distribution is negatively skewed if the distribution's tail on the left side is longer or "fatter" than the right side. This means that there are more data points on the right side, and the distribution as a longer left tail.
- Values:
< -1
- Symmetric:
- If the distribution is roughly the same on both sides, it is symmetric, and the skewness is close to 0.
- Values:
~ 0
Meaning of kurtosis:
- Mesokurtic (Normal distribution):
- A distribution with kurtosis similar to that of a normal distribution
- Values:
~ 0
- Leptokurtic:
- A distribution with pisitive kurtosis, indicating heavier tails and a more peaked central region compared to a normal distribution
- Values:
> 1
- Platykurtic:
- A normal distribution with a negative kurtosis, indicating lighter tails and a flatter central region compared to a normal distribution
- Values:
< -1
def calculate_skewness_kurtosis_numerical(
np_dataset: np.array,
) -> t.Tuple[float, float]:
dataset_skewness = scpy_stats.skew(np_dataset)
dataset_kurtosis = scpy_stats.kurtosis(np_dataset)
return dataset_skewness, dataset_kurtosis
for numerical_var in NUMERICAL_VARIABLES:
dataset_for_numerical_val = DatasetManager.obj_list_to_flat_np_array(
dataset_obj_list, [numerical_var]
)
dataset_skewness, dataset_kurtosis = calculate_skewness_kurtosis_numerical(
dataset_for_numerical_val
)
print(f"On numerical var {numerical_var}")
print(f"Skewness: {dataset_skewness}")
print(f"Kurtosis: {dataset_kurtosis}")
print(f"{['-' * 10]}\n")
On numerical var Age Skewness: 1.5280142589581394 Kurtosis: 2.8168588574298665 ['----------'] On numerical var Height Skewness: -0.012848459164393682 Kurtosis: -0.5644577300848859 ['----------'] On numerical var Weight Skewness: 0.2552297058507633 Kurtosis: -0.7010823664264967 ['----------'] On numerical var FCVC Skewness: -0.4325967095068614 Kurtosis: -0.6388791385493944 ['----------'] On numerical var NCP Skewness: -1.1063109919189047 Kurtosis: 0.38177433951646744 ['----------'] On numerical var CH2O Skewness: -0.10483672396474479 Kurtosis: -0.8801543419699018 ['----------'] On numerical var FAF Skewness: 0.4981349859683153 Kurtosis: -0.6219600181728873 ['----------'] On numerical var TUE Skewness: 0.6180627956587121 Kurtosis: -0.5502024431375689 ['----------']
4. Frequency of categorical data¶
Here we count how often we see the categorical data in a data series.
def calculate_frequency_of_data_categorical(dataset: t.List[str]) -> t.Dict[str, int]:
counts = {}
for entry in dataset:
if entry in counts:
counts[entry] += 1
else:
counts[entry] = 1
return counts
for categorical_var in CATEGORICAL_VARIABLES:
dataset_for_categorical_val = DatasetManager.obj_list_to_flat_list(
dataset_obj_list, [categorical_var]
)
dataset_frequency = calculate_frequency_of_data_categorical(
dataset_for_categorical_val
)
print(f"On categorical var {categorical_var}")
for entry in dataset_frequency:
print(f'Frequency of value "{entry}": {dataset_frequency[entry]}')
print(f"{['-' * 10]}")
On categorical var FAVC Frequency of value "no": 245 Frequency of value "yes": 1866 ['----------'] On categorical var CAEC Frequency of value "Sometimes": 1765 Frequency of value "Frequently": 242 Frequency of value "Always": 53 Frequency of value "no": 51 ['----------'] On categorical var CALC Frequency of value "no": 639 Frequency of value "Sometimes": 1401 Frequency of value "Frequently": 70 Frequency of value "Always": 1 ['----------'] On categorical var SCC Frequency of value "no": 2015 Frequency of value "yes": 96 ['----------'] On categorical var MTRANS Frequency of value "Public_Transportation": 1580 Frequency of value "Walking": 56 Frequency of value "Automobile": 457 Frequency of value "Motorbike": 11 Frequency of value "Bike": 7 ['----------'] On categorical var Gender Frequency of value "Female": 1043 Frequency of value "Male": 1068 ['----------'] On categorical var family_history_with_overweight Frequency of value "yes": 1726 Frequency of value "no": 385 ['----------'] On categorical var SMOKE Frequency of value "no": 2067 Frequency of value "yes": 44 ['----------'] On categorical var NObeyesdad Frequency of value "Normal_Weight": 287 Frequency of value "Overweight_Level_I": 290 Frequency of value "Overweight_Level_II": 290 Frequency of value "Obesity_Type_I": 351 Frequency of value "Insufficient_Weight": 272 Frequency of value "Obesity_Type_II": 297 Frequency of value "Obesity_Type_III": 324 ['----------']
5. Graphs¶
Here we can find histograms, density charts and boxplots.
5.1. Histograms¶
Histograms plot how frequently we meet a data entry from the dataset.
for numerical_var in NUMERICAL_VARIABLES:
dataset_for_numerical_val = DatasetManager.obj_list_to_flat_np_array(
dataset_obj_list, [numerical_var]
)
plt.figure()
plt.hist(
dataset_for_numerical_val,
bins=30,
color="lightblue",
edgecolor="black",
alpha=0.7,
label=[numerical_var],
)
plt.title(f'Histogram for numerical variable "{numerical_var}"')
plt.xlabel("Values")
plt.ylabel("Count")
plt.legend()
plt.show()
for categorical_var in CATEGORICAL_VARIABLES:
dataset_for_categorical_val = DatasetManager.obj_list_to_flat_list(
dataset_obj_list, [categorical_var]
)
dataset_frequency = calculate_frequency_of_data_categorical(
dataset_for_categorical_val
)
dataset_keys = [key for key in dataset_frequency]
dataset_values = [dataset_frequency[key] for key in dataset_frequency]
plt.figure()
plt.bar(
dataset_keys, dataset_values, color="lightblue", edgecolor="black", alpha=0.7
)
plt.xticks(range(len(dataset_keys)), dataset_keys, rotation=45, ha="right")
plt.title(f'Histogram for categorical variable "{categorical_var}"')
plt.show()
5.2. Density charts¶
Density charts plot how frequently we meet a data entry from the dataset and what distribution they follow.
for numerical_var in NUMERICAL_VARIABLES:
dataset_for_numerical_val = DatasetManager.obj_list_to_flat_np_array(
dataset_obj_list, [numerical_var]
)
plt.figure()
sns.kdeplot(dataset_for_numerical_val, bw=0.1)
plt.hist(
dataset_for_numerical_val,
bins=30,
density=True,
color="lightblue",
edgecolor="black",
alpha=0.7,
label=[numerical_var],
)
plt.title(f'Density Chart for "{numerical_var}"')
plt.xlabel("Values")
plt.ylabel("Density")
plt.legend()
plt.show()
5.3. Boxplots¶
Boxplots charts show how the data "behaves":
- min
- max
- quantiles
- outliers
- median
- inter-quartile range (contains 50% of the data)
- Skewness of data
- Robustness to extreme values
- etc.
for numerical_var in NUMERICAL_VARIABLES:
dataset_for_numerical_val = DatasetManager.obj_list_to_flat_np_array(
dataset_obj_list, [numerical_var]
)
plt.figure()
plt.boxplot(dataset_for_numerical_val, labels=[numerical_var])
plt.xlabel("Group")
plt.ylabel("Values")
plt.title(f'Boxplot for "{numerical_var}"')
plt.show()
Multivariate analysis¶
The following will be applied:
- Find Pearson Correlation and Spearman (Rank) correlation
- Phi squared test, Fisher test and contingency tables
- T Test (Testul Mediilor), Z Test, ANOVA Test
- PCA in 2D si 3D; TSNE in 2d si 3D; Both using numerical + categorial
- Projections using "Projection pursuit" methodologies
- Stacked histograms
- Corrgrams
numeric_numeric_correlation_indexes: dict[tuple[str, str], float] = dict()
categorial_categorial_correlation_indexes: dict[tuple[str, str], float] = dict()
categorial_numeric_t_correlation_indexes: dict[tuple[str, str], float] = dict()
categorial_numeric_anova_correlation_indexes: dict[tuple[str, str], float] = dict()
1. Find Pearson Correlation and Spearman (Rank) correlation¶
pearson_correlation: list[tuple[str, str, float]] = []
rank_correlation: list[tuple[str, str, float]] = []
for it in range(len(NUMERICAL_VARIABLES) - 1):
for jt in range(it + 1, len(NUMERICAL_VARIABLES)):
lh_var, rh_var = NUMERICAL_VARIABLES[it], NUMERICAL_VARIABLES[jt]
lh_dataset = DatasetManager.obj_list_to_np_array_numeric(dataset_obj_list, [lh_var]).reshape(-1)
rh_dataset = DatasetManager.obj_list_to_np_array_numeric(dataset_obj_list, [rh_var]).reshape(-1)
pearson_cr = scpy_stats.pearsonr(lh_dataset, rh_dataset).correlation
rank_cr = scpy_stats.spearmanr(lh_dataset, rh_dataset).correlation
numeric_numeric_correlation_indexes[(lh_var, rh_var)] = pearson_cr
pearson_correlation.append((lh_var, rh_var, pearson_cr))
rank_correlation.append((lh_var, rh_var, rank_cr))
pearson_correlation = list(sorted(pearson_correlation, key=lambda x: x[2]))
print(f"Pearson Correlation\n\n")
for item in pearson_correlation:
print(f"{item}\n")
Pearson Correlation
('Age', 'TUE', -0.29693059206832395)
('Age', 'FAF', -0.14493832661742817)
('FCVC', 'TUE', -0.10113484623419644)
('Weight', 'TUE', -0.07156135896003547)
('Weight', 'FAF', -0.05143626950415489)
('Age', 'CH2O', -0.04530385780197365)
('Age', 'NCP', -0.04394372656120907)
('Height', 'FCVC', -0.0381210583958552)
('Age', 'Height', -0.0259581343195753)
('CH2O', 'TUE', 0.01196533841818707)
('Age', 'FCVC', 0.016290886053423527)
('FCVC', 'FAF', 0.019939398348716313)
('NCP', 'TUE', 0.03632557228485052)
('FCVC', 'NCP', 0.04221629598271067)
('Height', 'TUE', 0.051911666347923456)
('NCP', 'CH2O', 0.05708799585311857)
('FAF', 'TUE', 0.05856206584303293)
('FCVC', 'CH2O', 0.06846147191156764)
('Weight', 'NCP', 0.1074689879684435)
('NCP', 'FAF', 0.1295043068383895)
('CH2O', 'FAF', 0.1672364921916402)
('Weight', 'CH2O', 0.20057538691633658)
('Age', 'Weight', 0.20256010359865867)
('Height', 'CH2O', 0.21337591711036222)
('Weight', 'FCVC', 0.21612470500918698)
('Height', 'NCP', 0.24367172595790443)
('Height', 'FAF', 0.2947089984659658)
('Height', 'Weight', 0.4631361166156267)
rank_correlation = list(sorted(rank_correlation, key=lambda x: x[2]))
print(f"Rank Correlation\n\n")
for item in rank_correlation:
print(f"{item}\n")
Rank Correlation
('Age', 'TUE', -0.29807629070494945)
('Age', 'FAF', -0.20831375458250287)
('Age', 'NCP', -0.1056676147687238)
('FCVC', 'TUE', -0.08751427099494341)
('Height', 'FCVC', -0.0560789812639322)
('Weight', 'TUE', -0.0498696877020995)
('Weight', 'FAF', -0.04387104884312014)
('Age', 'Height', -0.0029564853729459706)
('Weight', 'NCP', 0.0028753532017528434)
('Age', 'CH2O', 0.01306394003222349)
('CH2O', 'TUE', 0.023161743695596683)
('FCVC', 'FAF', 0.02768765071431842)
('FAF', 'TUE', 0.05062900197946862)
('Age', 'FCVC', 0.06159363184028231)
('FCVC', 'CH2O', 0.06569298366637302)
('NCP', 'CH2O', 0.07022705632122264)
('Height', 'TUE', 0.08154809622237319)
('FCVC', 'NCP', 0.08618653258012095)
('NCP', 'TUE', 0.08726794187829554)
('NCP', 'FAF', 0.14491208415223555)
('CH2O', 'FAF', 0.15572041774565834)
('Height', 'NCP', 0.20378695212085873)
('Weight', 'FCVC', 0.20841708075953239)
('Height', 'CH2O', 0.2252371954248203)
('Weight', 'CH2O', 0.2255923594627013)
('Height', 'FAF', 0.32587044461242604)
('Age', 'Weight', 0.35677053757107424)
('Height', 'Weight', 0.4625481443436106)
best_numerical_correlations = pearson_correlation[:3] + pearson_correlation[-6:]
print(best_numerical_correlations)
[('Age', 'TUE', -0.29693059206832395), ('Age', 'FAF', -0.14493832661742817), ('FCVC', 'TUE', -0.10113484623419644), ('Age', 'Weight', 0.20256010359865867), ('Height', 'CH2O', 0.21337591711036222), ('Weight', 'FCVC', 0.21612470500918698), ('Height', 'NCP', 0.24367172595790443), ('Height', 'FAF', 0.2947089984659658), ('Height', 'Weight', 0.4631361166156267)]
2. Phi squared test, Fisher test and contingency tables¶
alfa = 0.05
chi2_test: list[tuple[str, str, float]] = []
fisher_test: list[tuple[str, str, float]] = []
for it in range(len(CATEGORICAL_VARIABLES) - 1):
for jt in range(it + 1, len(CATEGORICAL_VARIABLES)):
lh_var, rh_var = CATEGORICAL_VARIABLES[it], CATEGORICAL_VARIABLES[jt]
dataset = DatasetManager.object_list_to_pd_dataframe_contingency_table(dataset_obj_list, lh_var, rh_var).to_numpy()
stat, p, degrees_of_freedom, expected_contingency = scpy_stats.chi2_contingency(dataset)
categorial_categorial_correlation_indexes[(lh_var, rh_var)] = stat
if p <= alfa:
print(f"{(lh_var, rh_var)} DEPENDENTE CHI2={stat}")
chi2_test.append((lh_var, rh_var, stat))
else:
print(f"{(lh_var, rh_var)} INDEPENDENTE CHI2={stat}")
if dataset.shape == (2, 2):
odds, p_value = scpy_stats.fisher_exact(dataset)
if p_value <= alfa:
fisher_test.append((lh_var, rh_var, odds))
('FAVC', 'CAEC') DEPENDENTE CHI2=81.74211774864109
('FAVC', 'CALC') DEPENDENTE CHI2=42.39419448970789
('FAVC', 'SCC') DEPENDENTE CHI2=73.90562515374857
('FAVC', 'MTRANS') DEPENDENTE CHI2=88.89314393480664
('FAVC', 'Gender') DEPENDENTE CHI2=8.499937649441428
('FAVC', 'family_history_with_overweight') DEPENDENTE CHI2=89.68723603559711
('FAVC', 'SMOKE') DEPENDENTE CHI2=4.36715259929876
('FAVC', 'NObeyesdad') DEPENDENTE CHI2=233.34130356133423
('CAEC', 'CALC') DEPENDENTE CHI2=69.30932044844948
('CAEC', 'SCC') DEPENDENTE CHI2=56.836277542980895
('CAEC', 'MTRANS') DEPENDENTE CHI2=69.28370301459474
('CAEC', 'Gender') DEPENDENTE CHI2=39.08648859052029
('CAEC', 'family_history_with_overweight') DEPENDENTE CHI2=260.36443035979585
('CAEC', 'SMOKE') INDEPENDENTE CHI2=7.423523262779556
('CAEC', 'NObeyesdad') DEPENDENTE CHI2=802.9772817566468
('CALC', 'SCC') DEPENDENTE CHI2=9.490867810165144
('CALC', 'MTRANS') DEPENDENTE CHI2=69.30657070695034
('CALC', 'Gender') INDEPENDENTE CHI2=5.323365912789968
('CALC', 'family_history_with_overweight') INDEPENDENTE CHI2=3.305900315106015
('CALC', 'SMOKE') DEPENDENTE CHI2=25.732854971942185
('CALC', 'NObeyesdad') DEPENDENTE CHI2=338.5775202939282
('SCC', 'MTRANS') DEPENDENTE CHI2=14.26633297581688
('SCC', 'Gender') DEPENDENTE CHI2=21.262119551377683
('SCC', 'family_history_with_overweight') DEPENDENTE CHI2=70.2923361354359
('SCC', 'SMOKE') INDEPENDENTE CHI2=3.3394585049011867
('SCC', 'NObeyesdad') DEPENDENTE CHI2=123.02389868912441
('MTRANS', 'Gender') DEPENDENTE CHI2=59.226776053203565
('MTRANS', 'family_history_with_overweight') DEPENDENTE CHI2=33.37341172586305
('MTRANS', 'SMOKE') INDEPENDENTE CHI2=3.9016208578111136
('MTRANS', 'NObeyesdad') DEPENDENTE CHI2=292.59394813167995
('Gender', 'family_history_with_overweight') DEPENDENTE CHI2=21.656148159794412
('Gender', 'SMOKE') INDEPENDENTE CHI2=3.6150140661838517
('Gender', 'NObeyesdad') DEPENDENTE CHI2=657.746227342968
('family_history_with_overweight', 'SMOKE') INDEPENDENTE CHI2=0.3618266791472247
('family_history_with_overweight', 'NObeyesdad') DEPENDENTE CHI2=621.9794353945297
('SMOKE', 'NObeyesdad') DEPENDENTE CHI2=32.13783205600177
chi2_test = list(sorted(chi2_test, key=lambda x: x[2]))
print(f"CHI2 Test\n\n")
for item in chi2_test:
print(f"{item}\n")
CHI2 Test
('FAVC', 'SMOKE', 4.36715259929876)
('FAVC', 'Gender', 8.499937649441428)
('CALC', 'SCC', 9.490867810165144)
('SCC', 'MTRANS', 14.26633297581688)
('SCC', 'Gender', 21.262119551377683)
('Gender', 'family_history_with_overweight', 21.656148159794412)
('CALC', 'SMOKE', 25.732854971942185)
('SMOKE', 'NObeyesdad', 32.13783205600177)
('MTRANS', 'family_history_with_overweight', 33.37341172586305)
('CAEC', 'Gender', 39.08648859052029)
('FAVC', 'CALC', 42.39419448970789)
('CAEC', 'SCC', 56.836277542980895)
('MTRANS', 'Gender', 59.226776053203565)
('CAEC', 'MTRANS', 69.28370301459474)
('CALC', 'MTRANS', 69.30657070695034)
('CAEC', 'CALC', 69.30932044844948)
('SCC', 'family_history_with_overweight', 70.2923361354359)
('FAVC', 'SCC', 73.90562515374857)
('FAVC', 'CAEC', 81.74211774864109)
('FAVC', 'MTRANS', 88.89314393480664)
('FAVC', 'family_history_with_overweight', 89.68723603559711)
('SCC', 'NObeyesdad', 123.02389868912441)
('FAVC', 'NObeyesdad', 233.34130356133423)
('CAEC', 'family_history_with_overweight', 260.36443035979585)
('MTRANS', 'NObeyesdad', 292.59394813167995)
('CALC', 'NObeyesdad', 338.5775202939282)
('family_history_with_overweight', 'NObeyesdad', 621.9794353945297)
('Gender', 'NObeyesdad', 657.746227342968)
('CAEC', 'NObeyesdad', 802.9772817566468)
best_chi2_test = chi2_test[:-10]
for lh_var, rh_var, chi2_val in best_chi2_test[::-1]:
print(DatasetManager.object_list_to_pd_dataframe_contingency_table(dataset_obj_list, lh_var, rh_var))
print("\n" + "".join(["-"] * 10) + "\n")
CAEC Always Frequently Sometimes no FAVC no 12 67 157 9 yes 41 175 1608 42 ---------- SCC no yes FAVC no 207 38 yes 1808 58 ---------- family_history_with_overweight no yes SCC no 336 1679 yes 49 47 ---------- CALC Always Frequently Sometimes no CAEC Always 0 7 28 18 Frequently 1 16 120 105 Sometimes 0 45 1211 509 no 0 2 42 7 ---------- MTRANS Automobile Bike Motorbike Public_Transportation Walking CALC Always 0 0 0 0 1 Frequently 29 0 0 38 3 Sometimes 271 4 6 1091 29 no 157 3 5 451 23 ----------
MTRANS Automobile Bike Motorbike Public_Transportation Walking CAEC Always 12 1 1 33 6 Frequently 25 0 5 201 11 Sometimes 417 6 5 1300 37 no 3 0 0 46 2 ---------- Gender Female Male MTRANS Automobile 166 291 Bike 0 7 Motorbike 2 9 Public_Transportation 854 726 Walking 21 35 ---------- SCC no yes CAEC Always 45 8 Frequently 215 27 Sometimes 1711 54 no 44 7 ---------- CALC Always Frequently Sometimes no FAVC no 0 15 118 112 yes 1 55 1283 527 ---------- Gender Female Male CAEC Always 23 30 Frequently 161 81 Sometimes 844 921 no 15 36 ---------- family_history_with_overweight no yes MTRANS Automobile 50 407 Bike 2 5 Motorbike 5 6 Public_Transportation 309 1271 Walking 19 37 ---------- NObeyesdad Insufficient_Weight Normal_Weight Obesity_Type_I \ SMOKE no 271 274 345 yes 1 13 6 NObeyesdad Obesity_Type_II Obesity_Type_III Overweight_Level_I \ SMOKE no 282 323 287 yes 15 1 3 NObeyesdad Overweight_Level_II SMOKE no 285 yes 5 ---------- SMOKE no yes CALC Always 1 0 Frequently 63 7 Sometimes 1370 31 no 633 6 ---------- family_history_with_overweight no yes Gender Female 232 811 Male 153 915 ---------- Gender Female Male SCC no 973 1042 yes 70 26 ---------- MTRANS Automobile Bike Motorbike Public_Transportation Walking SCC no 444 6 9 1506 50 yes 13 1 2 74 6 ---------- SCC no yes CALC Always 1 0 Frequently 62 8 Sometimes 1346 55 no 606 33 ---------- Gender Female Male FAVC no 143 102 yes 900 966 ---------- SMOKE no yes FAVC no 235 10 yes 1832 34 ----------
fisher_test = list(sorted(fisher_test, key=lambda x: x[2]))
print(f"Fisher Test\n\n")
for item in fisher_test:
print(f"{item}\n")
Fisher Test
('FAVC', 'SCC', 0.17474965067536097)
('SCC', 'family_history_with_overweight', 0.1919509912362801)
('SCC', 'Gender', 0.34683301343570055)
('FAVC', 'SMOKE', 0.4361353711790393)
('FAVC', 'Gender', 1.5047712418300654)
('Gender', 'family_history_with_overweight', 1.7107903580667778)
('Gender', 'SMOKE', 1.912864934231633)
('SCC', 'SMOKE', 2.7838827838827838)
('FAVC', 'family_history_with_overweight', 3.7460484720758696)
3. T Test (Testul Mediilor), Z Test, ANOVA Test¶
alfa = 0.05
t_test: list[tuple[str, str, float]] = []
anova_test: list[tuple[str, str, float]] = []
for it in range(len(NUMERICAL_VARIABLES)):
for jt in range(len(CATEGORICAL_VARIABLES)):
lh_var, rh_var = NUMERICAL_VARIABLES[it], CATEGORICAL_VARIABLES[jt]
numerical_for_each_category = DatasetManager.get_numerical_for_each_category(dataset_obj_list, lh_var, rh_var)
if len(numerical_for_each_category) == 2:
lh_numeric_data, rh_numeric_data = numerical_for_each_category[0][0], numerical_for_each_category[1][0]
statistic, p_value = scpy_stats.ttest_ind(lh_numeric_data, rh_numeric_data)
categorial_numeric_t_correlation_indexes[(lh_var, rh_var)] = statistic
if p_value <= alfa:
print(f"Found {(lh_var, rh_var)} TO BE correlated with T Test={statistic}")
t_test.append((lh_var, rh_var, statistic))
else:
print(f"Found {(lh_var, rh_var)} NOT TO BE correlated with T Test={statistic}")
else:
statistic, p_value = scpy_stats.f_oneway(*[part[0] for part in numerical_for_each_category])
categorial_numeric_anova_correlation_indexes[(lh_var, rh_var)] = statistic
if p_value <= alfa:
print(f"Found {(lh_var, rh_var)} TO BE correlated with ANOVA Test={statistic}")
anova_test.append((lh_var, rh_var, statistic))
else:
print(f"Found {(lh_var, rh_var)} NOT TO BE correlated with ANOVA Test={statistic}")
Found ('Age', 'FAVC') TO BE correlated with T Test=-2.940621482248463
Found ('Age', 'CAEC') TO BE correlated with ANOVA Test=15.28169961642531
Found ('Age', 'CALC') TO BE correlated with ANOVA Test=4.964661133280788
Found ('Age', 'SCC') TO BE correlated with T Test=5.376630323046368
Found ('Age', 'MTRANS') TO BE correlated with ANOVA Test=306.70914436153976
Found ('Age', 'Gender') TO BE correlated with T Test=-2.225054898548922
Found ('Age', 'family_history_with_overweight') TO BE correlated with T Test=-9.654204585521246
Found ('Age', 'SMOKE') TO BE correlated with T Test=-4.2424047324934
Found ('Age', 'NObeyesdad') TO BE correlated with ANOVA Test=77.95415423043549
Found ('Height', 'FAVC') TO BE correlated with T Test=-8.324640226409423
Found ('Height', 'CAEC') TO BE correlated with ANOVA Test=18.417918877152303
Found ('Height', 'CALC') TO BE correlated with ANOVA Test=12.330063092417138
Found ('Height', 'SCC') TO BE correlated with T Test=6.198134379189559
Found ('Height', 'MTRANS') TO BE correlated with ANOVA Test=4.807337594541384
Found ('Height', 'Gender') TO BE correlated with T Test=-36.1439858729631
Found ('Height', 'family_history_with_overweight') TO BE correlated with T Test=-11.740418846491824
Found ('Height', 'SMOKE') TO BE correlated with T Test=-2.5526797754313053
Found ('Height', 'NObeyesdad') TO BE correlated with ANOVA Test=38.43231255660025
Found ('Weight', 'FAVC') TO BE correlated with T Test=-12.996183205856667
Found ('Weight', 'CAEC') TO BE correlated with ANOVA Test=149.9055766484753
Found ('Weight', 'CALC') TO BE correlated with ANOVA Test=51.402348504720244
Found ('Weight', 'SCC') TO BE correlated with T Test=9.467297014889331
Found ('Weight', 'MTRANS') TO BE correlated with ANOVA Test=6.814859291913126
Found ('Weight', 'Gender') TO BE correlated with T Test=-7.523365268812722
Found ('Weight', 'family_history_with_overweight') TO BE correlated with T Test=-26.290044638238182
Found ('Weight', 'SMOKE') NOT TO BE correlated with T Test=-1.1827664792692334
Found ('Weight', 'NObeyesdad') TO BE correlated with ANOVA Test=1966.5180176274885
Found ('FCVC', 'FAVC') NOT TO BE correlated with T Test=1.2534106383979502
Found ('FCVC', 'CAEC') TO BE correlated with ANOVA Test=9.198651503582724
Found ('FCVC', 'CALC') TO BE correlated with ANOVA Test=5.005483348478049
Found ('FCVC', 'SCC') TO BE correlated with T Test=-3.308280209582652
Found ('FCVC', 'MTRANS') TO BE correlated with ANOVA Test=2.592401648945186
Found ('FCVC', 'Gender') TO BE correlated with T Test=13.109924553316425
Found ('FCVC', 'family_history_with_overweight') NOT TO BE correlated with T Test=-1.8555618180030429
Found ('FCVC', 'SMOKE') NOT TO BE correlated with T Test=-0.6576753249593008
Found ('FCVC', 'NObeyesdad') TO BE correlated with ANOVA Test=112.31546186980398
Found ('NCP', 'FAVC') NOT TO BE correlated with T Test=0.3214721917425319
Found ('NCP', 'CAEC') TO BE correlated with ANOVA Test=16.965387860653784
Found ('NCP', 'CALC') TO BE correlated with ANOVA Test=8.42073855849327
Found ('NCP', 'SCC') NOT TO BE correlated with T Test=0.7175997482784392
Found ('NCP', 'MTRANS') NOT TO BE correlated with ANOVA Test=1.9270968884939499
Found ('NCP', 'Gender') TO BE correlated with T Test=-3.1115694247906007
Found ('NCP', 'family_history_with_overweight') TO BE correlated with T Test=-3.285950657193645
Found ('NCP', 'SMOKE') NOT TO BE correlated with T Test=-0.3587309610175394
Found ('NCP', 'NObeyesdad') TO BE correlated with ANOVA Test=26.81166184274833
Found ('CH2O', 'FAVC') NOT TO BE correlated with T Test=-0.44636099700339693
Found ('CH2O', 'CAEC') TO BE correlated with ANOVA Test=31.043847256744105
Found ('CH2O', 'CALC') TO BE correlated with ANOVA Test=6.022580003189467
Found ('CH2O', 'SCC') NOT TO BE correlated with T Test=-0.3690782687224163
Found ('CH2O', 'MTRANS') NOT TO BE correlated with ANOVA Test=1.466613000541393
Found ('CH2O', 'Gender') TO BE correlated with T Test=-4.985669884330864
Found ('CH2O', 'family_history_with_overweight') TO BE correlated with T Test=-6.845669428081791
Found ('CH2O', 'SMOKE') NOT TO BE correlated with T Test=1.470064610772763
Found ('CH2O', 'NObeyesdad') TO BE correlated with ANOVA Test=16.17114219437877
Found ('FAF', 'FAVC') TO BE correlated with T Test=4.988730482568133
Found ('FAF', 'CAEC') NOT TO BE correlated with ANOVA Test=1.736025503822377
Found ('FAF', 'CALC') TO BE correlated with ANOVA Test=13.567463037112436
Found ('FAF', 'SCC') TO BE correlated with T Test=-3.4179258182333108
Found ('FAF', 'MTRANS') TO BE correlated with ANOVA Test=9.056101009958532
Found ('FAF', 'Gender') TO BE correlated with T Test=-8.868352739102095
Found ('FAF', 'family_history_with_overweight') TO BE correlated with T Test=2.6068411680359014
Found ('FAF', 'SMOKE') NOT TO BE correlated with T Test=-0.515115629604034
Found ('FAF', 'NObeyesdad') TO BE correlated with ANOVA Test=17.4842004293805
Found ('TUE', 'FAVC') TO BE correlated with T Test=-3.149347488084263
Found ('TUE', 'CAEC') TO BE correlated with ANOVA Test=9.007932346212005
Found ('TUE', 'CALC') TO BE correlated with ANOVA Test=9.893124246844554
Found ('TUE', 'SCC') NOT TO BE correlated with T Test=0.5018847683978304
Found ('TUE', 'MTRANS') TO BE correlated with ANOVA Test=20.105164914771215
Found ('TUE', 'Gender') NOT TO BE correlated with T Test=-0.7931989920678144
Found ('TUE', 'family_history_with_overweight') NOT TO BE correlated with T Test=-1.0539219343714734
Found ('TUE', 'SMOKE') NOT TO BE correlated with T Test=-0.8089884209920915
Found ('TUE', 'NObeyesdad') TO BE correlated with ANOVA Test=7.876655737080669
t_test = list(sorted(t_test, key=lambda x: x[2]))
print(f"Testul mediilor (T-Test)\n\n")
for item in t_test:
print(f"{item}\n")
Testul mediilor (T-Test)
('Height', 'Gender', -36.1439858729631)
('Weight', 'family_history_with_overweight', -26.290044638238182)
('Weight', 'FAVC', -12.996183205856667)
('Height', 'family_history_with_overweight', -11.740418846491824)
('Age', 'family_history_with_overweight', -9.654204585521246)
('FAF', 'Gender', -8.868352739102095)
('Height', 'FAVC', -8.324640226409423)
('Weight', 'Gender', -7.523365268812722)
('CH2O', 'family_history_with_overweight', -6.845669428081791)
('CH2O', 'Gender', -4.985669884330864)
('Age', 'SMOKE', -4.2424047324934)
('FAF', 'SCC', -3.4179258182333108)
('FCVC', 'SCC', -3.308280209582652)
('NCP', 'family_history_with_overweight', -3.285950657193645)
('TUE', 'FAVC', -3.149347488084263)
('NCP', 'Gender', -3.1115694247906007)
('Age', 'FAVC', -2.940621482248463)
('Height', 'SMOKE', -2.5526797754313053)
('Age', 'Gender', -2.225054898548922)
('FAF', 'family_history_with_overweight', 2.6068411680359014)
('FAF', 'FAVC', 4.988730482568133)
('Age', 'SCC', 5.376630323046368)
('Height', 'SCC', 6.198134379189559)
('Weight', 'SCC', 9.467297014889331)
('FCVC', 'Gender', 13.109924553316425)
anova_test = list(sorted(anova_test, key=lambda x: x[2]))
print(f"Testul ANOVA\n\n")
for item in anova_test:
print(f"{item}\n")
Testul ANOVA
('FCVC', 'MTRANS', 2.592401648945186)
('Height', 'MTRANS', 4.807337594541384)
('Age', 'CALC', 4.964661133280788)
('FCVC', 'CALC', 5.005483348478049)
('CH2O', 'CALC', 6.022580003189467)
('Weight', 'MTRANS', 6.814859291913126)
('TUE', 'NObeyesdad', 7.876655737080669)
('NCP', 'CALC', 8.42073855849327)
('TUE', 'CAEC', 9.007932346212005)
('FAF', 'MTRANS', 9.056101009958532)
('FCVC', 'CAEC', 9.198651503582724)
('TUE', 'CALC', 9.893124246844554)
('Height', 'CALC', 12.330063092417138)
('FAF', 'CALC', 13.567463037112436)
('Age', 'CAEC', 15.28169961642531)
('CH2O', 'NObeyesdad', 16.17114219437877)
('NCP', 'CAEC', 16.965387860653784)
('FAF', 'NObeyesdad', 17.4842004293805)
('Height', 'CAEC', 18.417918877152303)
('TUE', 'MTRANS', 20.105164914771215)
('NCP', 'NObeyesdad', 26.81166184274833)
('CH2O', 'CAEC', 31.043847256744105)
('Height', 'NObeyesdad', 38.43231255660025)
('Weight', 'CALC', 51.402348504720244)
('Age', 'NObeyesdad', 77.95415423043549)
('FCVC', 'NObeyesdad', 112.31546186980398)
('Weight', 'CAEC', 149.9055766484753)
('Age', 'MTRANS', 306.70914436153976)
('Weight', 'NObeyesdad', 1966.5180176274885)
best_t_test = t_test[:8] + t_test[-4:]
best_anova_test = anova_test[-14:]
4. PCA in 2D si 3D; TSNE in 2d si 3D; Both using numerical + categorial¶
numeric_dataset = DatasetManager.obj_list_to_np_array_numeric(dataset_obj_list, NUMERICAL_VARIABLES)
numeric_dataset_scaled = StandardScaler().fit_transform(numeric_dataset)
categorical_to_binary_dataset = DatasetManager.obj_list_to_np_array_category_binary(dataset_obj_list, CATEGORICAL_VARIABLES_NO_LABEL)
numerical_categorical_dataset = np.hstack((numeric_dataset, categorical_to_binary_dataset))
numerical_categorical_dataset_scaled = StandardScaler().fit_transform(numerical_categorical_dataset)
label_dataset_np_str = DatasetManager.obj_list_to_np_array(dataset_obj_list, [LABEL_VARIABLE])
factorize_result = pd.factorize(pd.DataFrame(label_dataset_np_str)[0])
label_dataset, label_indexes = factorize_result[0], factorize_result[1].array.to_numpy().tolist()
numerical_pca_transformed_2d = PCA(n_components=2).fit_transform(numeric_dataset)
numerical_pca_transformed_2d_scaled = PCA(n_components=2).fit_transform(numeric_dataset_scaled)
numerical_pca_transformed_3d = PCA(n_components=3).fit_transform(numeric_dataset)
numerical_pca_transformed_3d_scaled = PCA(n_components=3).fit_transform(numeric_dataset_scaled)
numerical_categorial_pca_transformed_2d = PCA(n_components=2).fit_transform(numerical_categorical_dataset)
numerical_categorial_pca_transformed_2d_scaled = PCA(n_components=2).fit_transform(numerical_categorical_dataset_scaled)
numerical_categorial_pca_transformed_3d = PCA(n_components=3).fit_transform(numerical_categorical_dataset)
numerical_categorial_pca_transformed_3d_scaled = PCA(n_components=3).fit_transform(numerical_categorical_dataset_scaled)
colors = ['red', 'green', 'blue', 'purple', 'orange', 'yellow', 'gray']
assert len(label_indexes) == len(colors)
def plot_2d(title: str, data_to_plot: np.array, labels_for_plot: np.array):
plt.scatter(data_to_plot[:, 0], data_to_plot[:, 1], c=labels_for_plot, cmap=matplotlib.colors.ListedColormap(colors))
cb = plt.colorbar()
loc = np.arange(0,len(colors), 1)
cb.set_ticks(loc)
cb.set_ticklabels(label_indexes)
plt.title(title)
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.tight_layout()
plt.show()
plt.clf()
plot_2d("PCA 2D - Numeric Scaled", numerical_pca_transformed_2d_scaled, label_dataset)
<Figure size 640x480 with 0 Axes>
plot_2d("PCA 2D - Numeric + Categorial Scaled", numerical_categorial_pca_transformed_2d_scaled, label_dataset)
<Figure size 640x480 with 0 Axes>
data_to_plot_pd = pd.DataFrame(numerical_pca_transformed_3d_scaled, columns=["Component 1", "Component 2", "Component 3"])
data_to_plot_pd = pd.concat([data_to_plot_pd, pd.DataFrame(label_dataset_np_str, columns=["Label"])], axis=1, join="inner")
px.scatter_3d(data_to_plot_pd, x="Component 1", y="Component 2", z="Component 3", color="Label", title="PCA 3D - Numeric Scaled")
data_to_plot_pd = pd.DataFrame(numerical_categorial_pca_transformed_3d_scaled, columns=["Component 1", "Component 2", "Component 3"])
data_to_plot_pd = pd.concat([data_to_plot_pd, pd.DataFrame(label_dataset_np_str, columns=["Label"])], axis=1, join="inner")
px.scatter_3d(data_to_plot_pd, x="Component 1", y="Component 2", z="Component 3", color="Label", title="PCA 3D - Numeric + Categorial Scaled")
numerical_tsne_transformed_2d = TSNE(n_components=2).fit_transform(numeric_dataset)
numerical_tsne_transformed_2d_scaled = TSNE(n_components=2).fit_transform(numeric_dataset_scaled)
numerical_tsne_transformed_3d = TSNE(n_components=3).fit_transform(numeric_dataset)
numerical_tsne_transformed_3d_scaled = TSNE(n_components=3).fit_transform(numeric_dataset_scaled)
numerical_categorial_tsne_transformed_2d = TSNE(n_components=2).fit_transform(numerical_categorical_dataset)
numerical_categorial_tsne_transformed_2d_scaled = TSNE(n_components=2).fit_transform(numerical_categorical_dataset_scaled)
numerical_categorial_tsne_transformed_3d = TSNE(n_components=3).fit_transform(numerical_categorical_dataset)
numerical_categorial_tsne_transformed_3d_scaled = TSNE(n_components=3).fit_transform(numerical_categorical_dataset_scaled)
numerical_tsne_transformed_2d_scaled_per8 = TSNE(n_components=2, perplexity=12).fit_transform(numeric_dataset_scaled)
numerical_categorial_tsne_transformed_2d_scaled_per8 = TSNE(n_components=2, perplexity=12).fit_transform(numerical_categorical_dataset_scaled)
numerical_tsne_transformed_2d_scaled_per50 = TSNE(n_components=2, perplexity=50).fit_transform(numeric_dataset_scaled)
numerical_categorial_tsne_transformed_2d_scaled_per50 = TSNE(n_components=2, perplexity=50).fit_transform(numerical_categorical_dataset_scaled)
plot_2d("TSNE 2D - Numeric Scaled", numerical_tsne_transformed_2d_scaled, label_dataset)
plot_2d("TSNE 2D - Numeric Scaled Perplexity=8", numerical_tsne_transformed_2d_scaled_per8, label_dataset)
plot_2d("TSNE 2D - Numeric Scaled Perplexity=50", numerical_tsne_transformed_2d_scaled_per50, label_dataset)
plot_2d("TSNE 2D - Numeric + Categorial Scaled", numerical_categorial_tsne_transformed_2d_scaled, label_dataset)
plot_2d("TSNE 2D - Numeric + Categorial Scaled Per=8", numerical_categorial_tsne_transformed_2d_scaled_per8, label_dataset)
plot_2d("TSNE 2D - Numeric + Categorial Scaled Perplexity=50", numerical_categorial_tsne_transformed_2d_scaled_per50, label_dataset)
<Figure size 640x480 with 0 Axes>
data_to_plot_pd = pd.DataFrame(numerical_tsne_transformed_3d_scaled, columns=["Component 1", "Component 2", "Component 3"])
data_to_plot_pd = pd.concat([data_to_plot_pd, pd.DataFrame(label_dataset_np_str, columns=["Label"])], axis=1, join="inner")
px.scatter_3d(data_to_plot_pd, x="Component 1", y="Component 2", z="Component 3", color="Label", title="TSNE 3D - Numeric Scaled")
data_to_plot_pd = pd.DataFrame(numerical_categorial_tsne_transformed_3d_scaled, columns=["Component 1", "Component 2", "Component 3"])
data_to_plot_pd = pd.concat([data_to_plot_pd, pd.DataFrame(label_dataset_np_str, columns=["Label"])], axis=1, join="inner")
px.scatter_3d(data_to_plot_pd, x="Component 1", y="Component 2", z="Component 3", color="Label", title="TSNE 3D - Numeric + Categorial Scaled")
5. Projections using "Projection pursuit" methodologies¶
This will be run using more hackish ways. Results:
Result for an outlier-based guided tour:

Result for an latent-Dirichlet-allocation-based guided tour:

%%sh
R_LIBS_USER='~/.r' Rscript homework1.tourr.r
Converting input data to the required matrix format.
Value 1.358 17.7 % better - NEW BASIS
Using half_range 3.8
Value 1.382 1.8 % better - NEW BASIS Value 1.383 0.0 % better Value 1.383 0.0 % better Value 1.383 0.1 % better Value 1.384 0.1 % better Value 1.384 0.1 % better Value 1.383 0.1 % better Value 1.384 0.1 % better Value 1.384 0.1 % better Value 1.383 0.0 % better Value 1.383 0.1 % better Value 1.383 0.0 % better Value 1.383 0.1 % better Value 1.384 0.1 % better Value 1.383 0.1 % better Value 1.384 0.1 % better Value 1.383 0.1 % better Value 1.383 0.1 % better Value 1.384 0.1 % better Value 1.384 0.1 % better Value 1.384 0.1 % better Value 1.383 0.0 % better Value 1.384 0.1 % better Value 1.384 0.1 % better Value 1.383 0.1 % better No better bases found after 25 tries. Giving up. Final projection: -0.707 0.158 -0.603 -0.644 -0.370 0.748
Inserting image 16 at 1.50s (100%)... Encoding to gif... done! Converting input data to the required matrix format.
Value 0.955 20.1 % better - NEW BASIS
Using half_range 3.8
Value 0.956 0.1 % better - NEW BASIS Value 0.956 0.0 % better Value 0.956 0.0 % better Value 0.956 0.0 % better Value 0.956 0.0 % better Value 0.956 0.0 % better Value 0.956 0.0 % better Value 0.956 0.0 % better Value 0.956 0.0 % better Value 0.956 0.0 % better Value 0.956 0.0 % better Value 0.956 0.0 % better Value 0.956 0.0 % better Value 0.956 0.0 % better Value 0.956 0.0 % better Value 0.956 0.0 % better Value 0.956 0.0 % better Value 0.956 0.0 % better Value 0.956 0.0 % better Value 0.956 0.0 % better Value 0.956 0.0 % better Value 0.956 0.0 % better Value 0.956 0.0 % better Value 0.956 0.0 % better Value 0.956 0.0 % better No better bases found after 25 tries. Giving up. Final projection: -0.251 -0.750 0.966 -0.158 -0.056 0.642
Inserting image 16 at 1.50s (100%)... Encoding to gif... done!
[1] "/home/mihai/Documents/Personal/Facultate/Master Anul 1/Data mining/Seminar/Homework 1/resources/animation2.gif"
Warning messages: 1: In render_gif(normalised_data, tour_path = guided_tour(out_index), : Note: only 16 frames generated, argument frames = 300 is ignored. 2: In render_gif(normalised_data, tour_path = guided_tour(lda_pp(data$NObeyesdad)), : Note: only 16 frames generated, argument frames = 300 is ignored. 1: In render_gif(normalised_data, tour_path = guided_tour(out_index), : Note: only 16 frames generated, argument frames = 300 is ignored. 2: In render_gif(normalised_data, tour_path = guided_tour(lda_pp(data$NObeyesdad)), : Note: only 16 frames generated, argument frames = 300 is ignored.
6. Stacked histograms¶
Stacked histograms can help us with:
- identifying inverse relationships
- show pattern when dealing with categorical data
- comparisons between groups
- percentage distribution
for (numerical_var, categorical_var, _) in best_anova_test:
numerical_var_data = DatasetManager.obj_list_to_np_array_numeric(dataset_obj_list, [numerical_var])
categorical_var_data = DatasetManager.obj_list_to_np_array(dataset_obj_list, [categorical_var])
df = pd.DataFrame(data=np.hstack((numerical_var_data, categorical_var_data)), columns=[numerical_var, categorical_var])
df[numerical_var] = df[numerical_var].apply(pd.to_numeric)
df.pivot(columns=categorical_var).plot(kind = 'hist', stacked=True)
plt.xlabel(numerical_var)
for attribute_a, attribute_b in DatasetManager.make_combinations(CATEGORICAL_VARIABLES):
[dataset_attribute_a, dataset_attribute_b] = DatasetManager.obj_list_to_list(
dataset_obj_list, [attribute_a, attribute_b]
)
frequencies_dataset_attribute_a = calculate_frequency_of_data_categorical(
dataset_attribute_a
)
frequencies_dataset_attribute_b = calculate_frequency_of_data_categorical(
dataset_attribute_b
)
frequencies_dataset_attribute_a_keys = [
key for key in frequencies_dataset_attribute_a
]
frequencies_dataset_attribute_a_values = [
frequencies_dataset_attribute_a[key] for key in frequencies_dataset_attribute_a
]
frequencies_dataset_attribute_b_keys = [
key for key in frequencies_dataset_attribute_b
]
frequencies_dataset_attribute_b_values = [
frequencies_dataset_attribute_b[key] for key in frequencies_dataset_attribute_b
]
if frequencies_dataset_attribute_a_keys != frequencies_dataset_attribute_b_keys:
continue
index = np.arange(len(frequencies_dataset_attribute_a_keys))
plt.figure()
plt.hist(
[index, index],
bins=len(frequencies_dataset_attribute_a_values),
weights=[
frequencies_dataset_attribute_a_values,
frequencies_dataset_attribute_b_values,
],
color=["lightblue", "#FF7F7F"],
edgecolor="black",
alpha=0.7,
align="mid",
stacked=True,
label=[attribute_a, attribute_b],
)
plt.title(f'Histogram for categorical variables "{attribute_a}" - "{attribute_b}"')
plt.xlabel("Categories")
plt.ylabel("Count")
plt.xticks(index, frequencies_dataset_attribute_a_keys)
plt.legend()
plt.show()
7. Conditional Boxplots¶
Conditional Boxplots can help us with:
- identifying relationships between numerical and categorical variables
- explains variance of numerical variable
for (numerical_var, categorical_var, _) in best_anova_test:
numerical_var_data = DatasetManager.obj_list_to_np_array_numeric(dataset_obj_list, [numerical_var])
categorical_var_data = DatasetManager.obj_list_to_np_array(dataset_obj_list, [categorical_var])
df = pd.DataFrame(data=np.hstack((numerical_var_data, categorical_var_data)), columns=[numerical_var, categorical_var])
df[numerical_var] = df[numerical_var].apply(pd.to_numeric)
ax = sns.boxplot(data=df, x=numerical_var, y=categorical_var)
plt.title(f'Boxplot for "{numerical_var} - {categorical_var}"')
plt.show()
plt.clf()
<Figure size 640x480 with 0 Axes>
8. Combined Scatterplots¶
Combined Scatterplots can help us with:
- visualising relationships between 2 numerical variables
- see if the data is separable using just those 2 numerical variables
label_dataset_np_str = DatasetManager.obj_list_to_np_array(dataset_obj_list, [LABEL_VARIABLE])
for (numerical_var_lh, numerical_var_rh, _) in best_numerical_correlations:
numerical_var_data_lh = DatasetManager.obj_list_to_np_array_numeric(dataset_obj_list, [numerical_var_lh])
categorical_var_data_rh = DatasetManager.obj_list_to_np_array_numeric(dataset_obj_list, [numerical_var_rh])
df = pd.DataFrame(data=np.hstack((numerical_var_data_lh, categorical_var_data_rh)), columns=[numerical_var_lh, numerical_var_rh])
df = pd.concat([df, pd.DataFrame(label_dataset_np_str, columns=["Label"])], axis=1, join="inner")
sns.scatterplot(data=df, x=numerical_var_lh, y=numerical_var_rh, hue="Label")
plt.title(f'Scatterplot for "{numerical_var_lh} - {numerical_var_rh}"')
plt.show()
plt.clf()
<Figure size 640x480 with 0 Axes>
9. Corrgrams¶
def plot_variable_correlation(correlation_title: str, correlation_bar: str, correlation_map: dict):
variables = list(set([var[0] for var in correlation_map] + [var[1] for var in correlation_map]))
result_matrix = np.full(shape=(len(variables), len(variables)), fill_value=0, dtype=np.float32)
for it_ in range(len(variables)):
for jt_ in range(len(variables)):
if it_ == jt_:
continue
it_var = variables[it_]
jt_var = variables[jt_]
if (it_var, jt_var) in correlation_map:
key = (it_var, jt_var)
else:
key = (jt_var, it_var)
if key in correlation_map:
result_matrix[it_, jt_] = correlation_map[key]
else:
result_matrix[it_, jt_] = 0
result_matrix = StandardScaler().fit_transform(result_matrix)
for it_ in range(len(variables)):
result_matrix[it_, it_] = 0
plt.figure()
plt.imshow(result_matrix, cmap="hot")
plt.colorbar(label=correlation_bar, cmap="hot")
plt.title(correlation_title)
plt.xticks(range(len(variables)), variables, rotation=45, ha='right')
plt.yticks(range(len(variables)), variables)
plt.show()
plot_variable_correlation("Numeric - Numeric correlation matrix", "Pearson correlation",numeric_numeric_correlation_indexes)
plot_variable_correlation("Categorical - Categorical correlation matrix", "CHI2 Test",categorial_categorial_correlation_indexes)
plot_variable_correlation("Numerical - Categorical correlation matrix", "T Test",categorial_numeric_t_correlation_indexes)
plot_variable_correlation("Numerical - Categorical correlation matrix", "ANOVA Test",categorial_numeric_anova_correlation_indexes)